7 Applied example

In this section, I will illustrate the use of plotscaper, showcasing the workflow and the various features of the package.

The Institute of Health Information and Statistics of the Czech Republic (IHIS, ÚZIS in Czech) is a government agency established by the Czech Ministry of Health. Its primary roles is to collect, process, and report on medical data within the country of Czechia (ÚZIS 2024). Of interest, the institute provides high-quality, open-access medical data, including information about the use and manufacture of medicines, summaries of fiscal and employment records in medical facilities, and various epidemiological datasets.

7.0.1 The data set

The data set (Soukupová et al. 2023) contains longitudinal information about psychiatric care in Czechia. More specifically, it contains aggregated data on individuals released from long-term psychiatric care facilities between 2010 and 2022. It includes information such the region of the treatment facility, the sex of the patients, age category, diagnosis based on the international ICD-10 classification (World Health Organization 2024a, 2024b), the number of hospitalizations, and the total number of days spent in care by the given subset of patients.

Here’s the data set at a quick glance:

df <- read.csv("./data/longterm_care.csv")
dplyr::glimpse(df)
## Rows: 68,115
## Columns: 12
## $ year                   <int> 2019, 2016, 2011, 2013, 2019, 2013, 2018, 2017,…
## $ region_code            <chr> "CZ071", "CZ064", "CZ080", "CZ072", "CZ080", "C…
## $ region                 <chr> "Olomoucký kraj", "Jihomoravský kraj", "Moravsk…
## $ sex                    <chr> "female", "male", "male", "female", "male", "fe…
## $ diagnosis              <chr> "f10", "f2", "f7", "f4 without f42", "f60–f61",…
## $ reason_for_termination <chr> "release", "early termination", "early terminat…
## $ age_category           <chr> "40–49", "30–39", "30–39", "40–49", "50–59", "4…
## $ stay_category          <chr> "short-term", "medium-term", "short-term", "sho…
## $ field                  <chr> "psychiatry", "psychiatry", "psychiatry", "psyc…
## $ care_category          <chr> "adult", "adult", "adult", "adult", "adult", "a…
## $ cases                  <int> 13, 3, 2, 3, 1, 2, 2, 32, 2, 2, 1, 12, 1, 1, 2,…
## $ days                   <int> 196, 345, 38, 108, 1, 47, 319, 3813, 120, 256, …

The data set contains over 68,000 rows, totalling over 410,000 hospitalizations. Each row records the number patients with a particular set of of characteristics released from a treatment facility during a given year, and the number of days the patients spent in treatment in total.

The original dataset used Czech column names and category labels. To make the analysis more easily accessible to non-Czech speakers, I took the liberty of translating most of these to English (excluding the region variable). The translation script is available in the thesis repository, at the following path: ./data/longterm_care_translate.R. Additionally, the data set website contains a JSON schema with a text description of each of the variables (Soukupová et al. 2023). I took the liberty of translating these descriptions as well, and provide them below, in table Table 7.1:

Table 7.1: Schema of the long-term care data set, including the original column names (Czech), as well as translated names and descriptions.
Translated name Original name Description
year rok The year hospitalization was terminated
region_code kraj_kod Region code based on the NUTS 3 classification
region kraj_nazev Region where the facility was located
sex pohlavi Classification of patients’ sex
diagnosis zakladni_diagnoza The primary diagnosis of the psychiatric disorder based on the ICD-10 classification
reason_for_termination ukonceni The reason for termination of care
age_category vekova_kategorie Classification of patients’ age category
stay_category kategorie_delky_hospitalizace Classification of hospitalization based on length: short-term (< 3 months), medium-term (3-6 months), and long-term (6+ months)
field obor The field of psychiatric care
care_category kategorie_pece Classification of care: child or adult
cases pocet_hospitalizaci The total number of cases/hospitalizations in the given subgroup of patients
days delka_hospitlizace The total time spent in care, in days (= sum of the number of days all patients in a given subgroup spent in care)

7.0.2 Interactive exploration

Now it is time to start exploring the data using plotscaper.

The first to note about the data set is that the data has been aggregated, such that each row represents the combined number of releases within a given subset of patients. For example, the first row records that, in the year 2019, there were 13 females aged 40-49 who were released from treatment facilities in Olomoucký kraj, after having been in short-term care with F10 ICD-10 diagnosis (mental and behavioural disorders due to use of alcohol, World Health Organization 2024a) for a sum total of 196 days:

df[1, ]
##   year region_code         region    sex diagnosis reason_for_termination
## 1 2019       CZ071 Olomoucký kraj female       f10                release
##   age_category stay_category      field care_category cases days
## 1        40–49    short-term psychiatry         adult    13  196

Thus, the two primary continuous variables in the data are cases (the number of patients in a given subgroup released from care) and days (the number of days the given subgroup of patients spent in care). Intuitively, we should expect a fairly linear relationship between these variables, such that a larger group of patients should spend a greater number of days in care, in total. We can use plotscaper to visualize this relationship via a scatterplot:

library(plotscaper) # Load in the package

create_schema(df) |> # Create a schema that can be modified declaratively
  add_scatterplot(c("cases", "days")) |>
  render() # Render the figure

Interestingly, there seems to be a leaf-shaped pattern in the data, with three distinct “leaflets”, each suggesting a linear relationship. If we look at the data, we can see that the stay_category variable has three levels, corresponding to short-term (< 3 months), medium-term (3-6 months), and long-term (6+ months) care. If we mark the cases corresponding to the three categories in different colors, we can see that these indeed correspond to the three leaflets:

df |>
  create_schema() |>
  add_scatterplot(c("cases", "days")) |>
  add_barplot(c("stay_category", "cases")) |>
  assign_cases(which(df$stay_category == "short-term"), 1) |>
  assign_cases(which(df$stay_category == "long-term"), 2) |>
  render()

However, this does not really explain the absence of points between the different leaflets - if the distribution of cases and days within each of the three stay_category levels were uniform, we should expect to see more points in the gaps between the leaflets. This suggests that there may be a kind of a selection process at play, where patients are less likely to be released after at intervals which are close to the category boundaries. We can see this more easily if we plot the average number of days a group of patients spent in care:

df$avg_days <- df$days / df$cases

df |>
  create_schema() |>
  add_scatterplot(c("cases", "avg_days")) |>
  add_barplot(c("stay_category", "cases")) |>
  assign_cases(which(df$stay_category == "short-term"), 1) |>
  assign_cases(which(df$stay_category == "long-term"), 2) |>
  set_scale("scatterplot1", "y", transformation = "log10", default = TRUE) |>
  render()

Now we can see the gaps between the three different distributions along the y-axis. By pressing the Q key and hovering over the points near these gaps, we can see that there are only few patient groups where the average patients spends between 60-90 or 150-210 days in care (corresponding to gaps at 2-3 months and 5-7 months). This suggests that, perhaps, if a patient spends 2 months in care, the healthcare providers are more likely to keep them in care longer by transfer them to medium-term care, and likewise, if a patients spends 5 months in care, they are more likely to be transferred to long-term care (or, alternatively, the patients may be released early).

knitr::opts_chunk$set(eval = TRUE)

References

Soukupová, J, H Melicharová, O Šanca, V Bartůněk, J Jarkovský, and M Komenda. 2023. Dlouhodobá psychiatrická péče.” NZIP. https://www.nzip.cz/data/2060-dlouhodoba-psychiatricka-pece.
ÚZIS. 2024. About us - ÚZIS ČR.” https://www.uzis.cz/index-en.php?pg=about-us.
World Health Organization. 2024a. ICD-10 Version:2019.” https://icd.who.int/browse10/2019/en.
———. 2024b. International Classification of Diseases (ICD).” https://www.who.int/standards/classifications/classification-of-diseases.